Skip to content

Stream tojson#15

Merged
thaliaarchi merged 5 commits intowader:masterfrom
thaliaarchi:tojson-stream
Mar 8, 2024
Merged

Stream tojson#15
thaliaarchi merged 5 commits intowader:masterfrom
thaliaarchi:tojson-stream

Conversation

@thaliaarchi
Copy link
Collaborator

@thaliaarchi thaliaarchi commented Mar 8, 2024

I have refactored _tojson to stream strings and reuse concatenations. It is now significantly faster, and has fully recovered from the performance regression when I added colors in #14.

Since jq is buffered by default, I saw no penalties from streaming tiny strings instead of concatenating into larger chunks like lines. There may be a sweet spot for chunk sizes that would suit jq --unbuffered and gojq, but I didn't benchmark it carefully like the rest.

Benchmarks

The . benchmark exercises the streaming behavior and the tojson benchmark collects it into an array and joins it. Notice that streamed (tojson) is now on faster than before-colors (tojson), so that performance loss has been fully regained. before-colors (.) is actually deceptive, because that revision dumped using host tojson instead of _tojson. There's no good comparison for ., but it seems to be as expected, matching tojson.

  • streamed: d6d9e4e (Fold newline into indent prefix, 2024-03-08), i.e., this PR
  • master: d0fdc43 (Make eval work in eval, 2024-03-08), i.e., origin/master at the time of this PR
  • master-passthrough: d0fdc43 (Make eval work in eval, 2024-03-08) with _tojson stubbed with host tojson
  • after-colors: 0532d9c (Fix --help with --join-output, 2024-02-07), i.e., the final commit related to color support
  • before-colors: 8a02423 (Add query? shorthand for try query catch empty, 2024-01-30), i.e., the commit before I added color support
Command Mean [s] Min [s] Max [s] Relative
streamed/jqjq . < large.json 2.004 ± 0.132 1.845 2.222 1.69 ± 0.13
streamed/jqjq tojson < large.json 2.124 ± 0.148 1.957 2.471 1.80 ± 0.15
master/jqjq . < large.json 3.085 ± 0.152 2.870 3.364 2.61 ± 0.17
master/jqjq tojson < large.json 2.630 ± 0.052 2.560 2.702 2.22 ± 0.10
master-passthrough/jqjq . < large.json 1.436 ± 0.153 1.319 1.748 1.21 ± 0.14
master-passthrough/jqjq tojson < large.json 1.273 ± 0.041 1.200 1.337 1.08 ± 0.06
after-colors/jqjq . < large.json 2.926 ± 0.266 2.715 3.472 2.47 ± 0.25
after-colors/jqjq tojson < large.json 2.480 ± 0.045 2.418 2.569 2.10 ± 0.10
before-colors/jqjq . < large.json 1.183 ± 0.050 1.127 1.260 1.00
before-colors/jqjq tojson < large.json 2.213 ± 0.076 2.120 2.321 1.87 ± 0.10

Generated with:

#!/usr/bin/env bash
set -eEuo pipefail

mkdir bench
cd bench
git worktree add streamed           d6d9e4e2524baadd3322cb2d0979f76a7af5b03e
git worktree add master             d0fdc43b71d285198ab4b712a1c3eb7554563d7c
git worktree add master-passthrough d0fdc43b71d285198ab4b712a1c3eb7554563d7c
git worktree add after-colors       0532d9c7e08518ca90812021da5164957b2f937b
git worktree add before-colors      8a02423cabb05a4467b27738d3add579822d3d14

# Stub _tojson to replace it with host tojson
patch -p1 -d master-passthrough <<EOF
diff --git a/jqjq.jq b/jqjq.jq
index 2500cc6..f7f2c85 100644
--- a/jqjq.jq
+++ b/jqjq.jq
@@ -1220,7 +1220,8 @@ def _tojson(\$opts):
   | _f(\$o; \$o.indent * " ")
   | if type == "array" then flatten | join("") end
   );
-def _tojson: _tojson({});
+def _tojson(\$opts): tojson;
+def _tojson: tojson;

 def undefined_func_error:
   error("undefined function \(.name)");
EOF

git clone --depth=1 https://github.com/wspace/corpus
find corpus -name project.json | sort | xargs cat > large.json

hyperfine \
  -L filter .,tojson \
  -L worktree streamed,master,master-passthrough,after-colors,before-colors \
  --export-json bench.json \
  --export-markdown bench.md \
  '{worktree}/jqjq {filter} < large.json'

# cd ..
# rm -rf bench
# git worktree prune

This directly streams strings from _tojson_stream and avoids string
concatenation and flattening arrays, to lower memory usage. Furthermore,
the ANSI color, stream separator, and indentation constructions are
moved earlier.
Now, color and output options affect the formatting of values and
--no-builtins affects the environment in the REPL.
Copy link
Owner

@wader wader left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great stuff! thanks

I wonder if it's soon time for some tojson tests

end
| if $opts.raw_no_lf | not then ., "\n" end
| if $opts.raw_output0 then ., "\u0000" end
| dump($opts)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_filter/1 looks quite neat now!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's just like how it was before color support! Just with dump($opts) instead of tojson haha

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aha hehe, re-neated, good!

@thaliaarchi
Copy link
Collaborator Author

Yeah, it could probably use some of those. jq's tests which cover colors are written in bash, so it would entail writing more bash 😈. It might be possible in jq even without the dlopen branch, but I haven't thought about it.

Good thing you mentioned it, though. I ran my local test from when I was fixing reset consistency with jq colors, and found a bug, which has now been patched.

@thaliaarchi thaliaarchi merged commit 8632088 into wader:master Mar 8, 2024
@wader
Copy link
Owner

wader commented Mar 8, 2024

I wonder if we could extend the test format somehow to make it nicer new lines in the output and maybe escaped hex etc? maybe some kind of "raw" string mode for outputs?

@wader
Copy link
Owner

wader commented Mar 8, 2024

👍 i try to add tests even for the most trivial things in my projects, i just don't trust myself :)

@thaliaarchi
Copy link
Collaborator Author

thaliaarchi commented Mar 8, 2024

Or there could be a flag to fromjson the .test output to get the raw output? Then we could write the expected output with escapes.

@thaliaarchi thaliaarchi deleted the tojson-stream branch March 8, 2024 11:16
@wader
Copy link
Owner

wader commented Mar 8, 2024

Or there could be a flag to fromjson the .test output to get the raw output? Then we could write the expected output with escapes.

Could you give an example?

My aim is mostly to not end up with tests where the expected output is one-liner messes, i'm thinking something:

"aaa\n\bbb\nccc\n", "hello"
null
"""
aaa
bbb
ccc
"""
"hello"

@thaliaarchi
Copy link
Collaborator Author

I think multiline strings would solve part of it (though since the closing delimiter starts a line, we'd need to ignore the LF before it, so we could differentiate text with a trailing LF and without).

The bigger problem I see is that we have no way to configure the CLI options passed. Perhaps a directive like !options --color-output --raw-output0, which is invalid jq syntax, could be the first thing in the test case and supply the CLI invocation.

@wader
Copy link
Owner

wader commented Mar 8, 2024

I think multiline strings would solve part of it (though since the closing delimiter starts a line, we'd need to ignore the LF before it, so we could differentiate text with a trailing LF and without).

Yeap!

The bigger problem I see is that we have no way to configure the CLI options passed. Perhaps a directive like !options --color-output --raw-output0, which is invalid jq syntax, could be the first thing in the test case and supply the CLI invocation.

That would be great. for fq i made the test format be more or less a "transcript" so that tests looks exactly like normal shell invocations, ex https://github.com/wader/fq/blob/master/pkg/interp/testdata/debug.fqtest, that has worked very well. Also makes it quite easy to automatically rewrite expected with new actual output. (Just don't look at the implementation, is a less)

@thaliaarchi
Copy link
Collaborator Author

I really like the format in your fq tests. Using $ as the prefix doubles as invalid jq syntax, so it couldn’t be confused. I’d only change it to add an LF between tests.

Is the hacky part of it the quote handling? I was annoyed recently with fromjson not interacting nicely with single quotes. Maybe it could be done by replacing " with something from the Unicode private use area, then ' with ", passing it to fromjson, then restoring ".

@wader
Copy link
Owner

wader commented Mar 8, 2024

I really like the format in your fq tests. Using $ as the prefix doubles as invalid jq syntax, so it couldn’t be confused. I’d only change it to add an LF between tests.

Yes might be good idea, it can feel a bit cramped now

Is the hacky part of it the quote handling? I was annoyed recently with fromjson not interacting nicely with single quotes. Maybe it could be done by replacing " with something from the Unicode private use area, then ' with ", passing it to fromjson, then restoring ".

The thing that caused a bit of problem in the beginning was how to split the different tests when you only the the prompt to split on, is especially with the REPL support. At the moment the transcript parser is mess of balanced regex :) https://github.com/wader/fq/blob/master/internal/script/script.go#L467 :shame:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants